c:\programdata\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.utils.testing module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
  warnings.warn(message, FutureWarning)

Main Datasets (w/ hospitalised data)

Source: https://covidtracking.com/ Source: https://github.com/CSSEGISandData/COVID-19 Various state data, third party data, and various federal data

Combine, validate, and verify data sets.

# see what filtered main dataframe looks like for all 50 states: 
all_cases.head(50)
date state positive active hospitalizedCurrently hospitalizedCumulative inIcuCurrently onVentilatorCurrently recovered dataQualityGrade ... totalTestsViral positiveTestsViral negativeTestsViral positiveCasesViral commercialScore negativeRegularScore negativeScore positiveScore score grade
0 2020-06-30 AK 940.0 400.0 18.0 NaN NaN 1.0 526.0 A ... 112185.0 NaN NaN NaN 0 0 0 0 0 NaN
1 2020-06-30 AL 38045.0 18229.0 776.0 2769.0 NaN NaN 18866.0 B ... NaN NaN NaN 37536.0 0 0 0 0 0 NaN
2 2020-06-30 AR 20777.0 5976.0 290.0 1413.0 NaN 67.0 14531.0 A ... NaN NaN NaN 20777.0 0 0 0 0 0 NaN
4 2020-06-30 AZ 79215.0 68172.0 2793.0 4736.0 683.0 455.0 9411.0 A+ ... 531922.0 NaN NaN 78781.0 0 0 0 0 0 NaN
5 2020-06-30 CA 222917.0 NaN 6466.0 NaN 1751.0 NaN NaN B ... 4167139.0 NaN NaN 222917.0 0 0 0 0 0 NaN
6 2020-06-30 CO 32511.0 26350.0 271.0 5442.0 NaN NaN 4479.0 A ... NaN NaN NaN 29651.0 0 0 0 0 0 NaN
7 2020-06-30 CT 46514.0 34139.0 98.0 10268.0 NaN NaN 8053.0 B ... 464414.0 NaN NaN 44534.0 0 0 0 0 0 NaN
8 2020-06-30 DC 10327.0 8506.0 126.0 NaN 33.0 22.0 1270.0 A+ ... NaN NaN NaN 10327.0 0 0 0 0 0 NaN
9 2020-06-30 DE 11474.0 4298.0 64.0 NaN 13.0 NaN 6667.0 A+ ... NaN NaN NaN 10430.0 0 0 0 0 0 NaN
10 2020-06-30 FL 152434.0 NaN NaN 14879.0 NaN NaN NaN A ... 2337165.0 195894.0 2136944.0 152434.0 0 0 0 0 0 NaN
11 2020-06-30 GA 81291.0 NaN 1459.0 11051.0 NaN NaN NaN A ... 833878.0 74214.0 759664.0 81291.0 0 0 0 0 0 NaN
13 2020-06-30 HI 900.0 160.0 NaN 111.0 NaN NaN 722.0 D ... 90577.0 900.0 89677.0 900.0 0 0 0 0 0 NaN
14 2020-06-30 IA 29007.0 5080.0 133.0 NaN 34.0 20.0 23212.0 A+ ... NaN NaN NaN 29007.0 0 0 0 0 0 NaN
15 2020-06-30 ID 5752.0 1588.0 NaN 322.0 NaN NaN 4073.0 A ... 88763.0 NaN NaN 5212.0 0 0 0 0 0 NaN
16 2020-06-30 IL 144238.0 NaN 1560.0 NaN 401.0 185.0 NaN A ... 1602965.0 NaN NaN 143185.0 0 0 0 0 0 NaN
17 2020-06-30 IN 45594.0 8308.0 695.0 7065.0 272.0 97.0 34646.0 A+ ... NaN NaN NaN 45594.0 0 0 0 0 0 NaN
18 2020-06-30 KS 14443.0 13379.0 NaN 1152.0 NaN NaN 794.0 A ... NaN NaN NaN 14443.0 0 0 0 0 0 NaN
19 2020-06-30 KY 15624.0 11069.0 408.0 2621.0 75.0 NaN 3990.0 B ... NaN NaN NaN 15090.0 0 0 0 0 0 NaN
20 2020-06-30 LA 58095.0 12649.0 781.0 NaN NaN 83.0 42225.0 B ... NaN NaN NaN 58095.0 0 0 0 0 0 NaN
21 2020-06-30 MA 108882.0 NaN 733.0 11337.0 120.0 63.0 NaN A+ ... 1066060.0 NaN NaN 103701.0 0 0 0 0 0 NaN
22 2020-06-30 MD 67559.0 59387.0 452.0 10844.0 152.0 NaN 4982.0 A ... 652701.0 NaN NaN 67559.0 0 0 0 0 0 NaN
23 2020-06-30 ME 3253.0 502.0 29.0 348.0 9.0 4.0 2646.0 A ... 94592.0 3924.0 90532.0 2893.0 0 0 0 0 0 NaN
24 2020-06-30 MI 70728.0 13436.0 471.0 NaN 179.0 98.0 51099.0 A+ ... 1062116.0 87787.0 974329.0 63870.0 0 0 0 0 0 NaN
25 2020-06-30 MN 36303.0 3226.0 270.0 4054.0 136.0 NaN 31601.0 A ... 605316.0 NaN NaN 36303.0 0 0 0 0 0 NaN
26 2020-06-30 MO 21551.0 NaN 599.0 NaN NaN 66.0 NaN B ... 424214.0 23527.0 399926.0 21551.0 0 0 0 0 0 NaN
28 2020-06-30 MS 27247.0 6786.0 779.0 3156.0 167.0 91.0 19388.0 A ... 299511.0 NaN NaN 27067.0 0 0 0 0 0 NaN
29 2020-06-30 MT 967.0 303.0 12.0 101.0 NaN NaN 642.0 C ... NaN NaN NaN 967.0 0 0 0 0 0 NaN
30 2020-06-30 NC 64670.0 17789.0 908.0 NaN NaN NaN 45538.0 A ... NaN NaN NaN 64670.0 0 0 0 0 0 NaN
31 2020-06-30 ND 3576.0 293.0 25.0 231.0 NaN NaN 3195.0 D ... 182283.0 NaN NaN 3576.0 0 0 0 0 0 NaN
32 2020-06-30 NE 19042.0 5226.0 121.0 1330.0 NaN NaN 13547.0 B ... NaN NaN NaN 19042.0 0 0 0 0 0 NaN
33 2020-06-30 NH 5760.0 958.0 34.0 565.0 NaN NaN 4435.0 B ... NaN NaN NaN 5760.0 0 0 0 0 0 NaN
34 2020-06-30 NJ 171667.0 126419.0 992.0 19847.0 211.0 174.0 30213.0 A+ ... NaN NaN NaN 171667.0 0 0 0 0 0 NaN
35 2020-06-30 NM 11982.0 6193.0 119.0 1876.0 NaN NaN 5296.0 B ... NaN NaN NaN 11982.0 0 0 0 0 0 NaN
36 2020-06-30 NV 18456.0 17260.0 593.0 NaN 134.0 65.0 689.0 A+ ... 322944.0 NaN NaN 18456.0 0 0 0 0 0 NaN
37 2020-06-30 NY 393454.0 298112.0 891.0 89995.0 217.0 137.0 70487.0 A ... NaN NaN NaN 393454.0 0 0 0 0 0 NaN
38 2020-06-30 OH 51789.0 NaN 722.0 7839.0 242.0 115.0 NaN B ... NaN NaN NaN 48222.0 0 0 0 0 0 NaN
39 2020-06-30 OK 13757.0 3285.0 315.0 1520.0 111.0 NaN 10085.0 A+ ... 343623.0 15029.0 327840.0 13757.0 0 0 0 0 0 NaN
40 2020-06-30 OR 8656.0 5727.0 149.0 1038.0 42.0 25.0 2722.0 A+ ... NaN NaN 228978.0 8265.0 0 0 0 0 0 NaN
41 2020-06-30 PA 86606.0 12405.0 634.0 NaN NaN 110.0 67552.0 A+ ... NaN NaN NaN 84130.0 0 0 0 0 0 NaN
43 2020-06-30 RI 16813.0 14232.0 74.0 2001.0 13.0 13.0 1631.0 A+ ... NaN NaN NaN 16813.0 0 0 0 0 0 NaN
44 2020-06-30 SC 36399.0 20189.0 1021.0 2854.0 NaN NaN 15471.0 A ... 380470.0 46401.0 334069.0 36297.0 0 0 0 0 0 NaN
45 2020-06-30 SD 6764.0 801.0 62.0 666.0 NaN NaN 5872.0 B ... NaN NaN NaN 6764.0 0 0 0 0 0 NaN
46 2020-06-30 TN 43509.0 15306.0 527.0 2665.0 NaN NaN 27599.0 B ... 792779.0 50413.0 742366.0 43161.0 0 0 0 0 0 NaN
47 2020-06-30 TX 159986.0 72744.0 6533.0 NaN NaN NaN 84818.0 B ... 1869282.0 NaN NaN NaN 0 0 0 0 0 NaN
48 2020-06-30 UT 22217.0 9647.0 250.0 1444.0 83.0 NaN 12398.0 A+ ... NaN NaN NaN 22217.0 0 0 0 0 0 NaN
49 2020-06-30 VA 62787.0 52944.0 902.0 8982.0 230.0 98.0 8080.0 A+ ... 642371.0 NaN NaN 60124.0 0 0 0 0 0 NaN
51 2020-06-30 VT 1208.0 199.0 16.0 NaN NaN NaN 953.0 B ... NaN NaN NaN 1208.0 0 0 0 0 0 NaN
52 2020-06-30 WA 32253.0 NaN 282.0 4323.0 NaN 50.0 NaN B ... NaN NaN NaN 32253.0 0 0 0 0 0 NaN
53 2020-06-30 WI 31662.0 8291.0 242.0 3446.0 79.0 NaN 22587.0 A+ ... NaN NaN NaN 28659.0 0 0 0 0 0 NaN
54 2020-06-30 WV 2905.0 540.0 27.0 NaN 10.0 3.0 2272.0 B ... NaN NaN NaN 2804.0 0 0 0 0 0 NaN

50 rows × 25 columns

#Add state level data, beds, beds/1k, population, abbreviation, and name:
all_cases.head(50)
date state abbrev population positive active hospitalizedCurrently hospitalizedCumulative inIcuCurrently onVentilatorCurrently ... negativeTestsViral positiveCasesViral commercialScore negativeRegularScore negativeScore positiveScore score grade bedsPerThousand total_beds
0 2020-06-30 Alaska AK 734002 940.0 400.0 18.0 NaN NaN 1.0 ... NaN NaN 0 0 0 0 0 NaN 2.2 1614.8044
1 2020-06-30 Alabama AL 4908621 38045.0 18229.0 776.0 2769.0 NaN NaN ... NaN 37536.0 0 0 0 0 0 NaN 3.1 15216.7251
2 2020-06-30 Arkansas AR 3038999 20777.0 5976.0 290.0 1413.0 NaN 67.0 ... NaN 20777.0 0 0 0 0 0 NaN 3.2 9724.7968
3 2020-06-30 Arizona AZ 7378494 79215.0 68172.0 2793.0 4736.0 683.0 455.0 ... NaN 78781.0 0 0 0 0 0 NaN 1.9 14019.1386
4 2020-06-30 California CA 39937489 222917.0 NaN 6466.0 NaN 1751.0 NaN ... NaN 222917.0 0 0 0 0 0 NaN 1.8 71887.4802
5 2020-06-30 Colorado CO 5845526 32511.0 26350.0 271.0 5442.0 NaN NaN ... NaN 29651.0 0 0 0 0 0 NaN 1.9 11106.4994
6 2020-06-30 Connecticut CT 3563077 46514.0 34139.0 98.0 10268.0 NaN NaN ... NaN 44534.0 0 0 0 0 0 NaN 2.0 7126.1540
7 2020-06-30 District of Columbia DC 720687 10327.0 8506.0 126.0 NaN 33.0 22.0 ... NaN 10327.0 0 0 0 0 0 NaN 4.4 3171.0228
8 2020-06-30 Delaware DE 982895 11474.0 4298.0 64.0 NaN 13.0 NaN ... NaN 10430.0 0 0 0 0 0 NaN 2.2 2162.3690
9 2020-06-30 Florida FL 21992985 152434.0 NaN NaN 14879.0 NaN NaN ... 2136944.0 152434.0 0 0 0 0 0 NaN 2.6 57181.7610
10 2020-06-30 Georgia GA 10736059 81291.0 NaN 1459.0 11051.0 NaN NaN ... 759664.0 81291.0 0 0 0 0 0 NaN 2.4 25766.5416
11 2020-06-30 Hawaii HI 1412687 900.0 160.0 NaN 111.0 NaN NaN ... 89677.0 900.0 0 0 0 0 0 NaN 1.9 2684.1053
12 2020-06-30 Iowa IA 3179849 29007.0 5080.0 133.0 NaN 34.0 20.0 ... NaN 29007.0 0 0 0 0 0 NaN 3.0 9539.5470
13 2020-06-30 Idaho ID 1826156 5752.0 1588.0 NaN 322.0 NaN NaN ... NaN 5212.0 0 0 0 0 0 NaN 1.9 3469.6964
14 2020-06-30 Illinois IL 12659682 144238.0 NaN 1560.0 NaN 401.0 185.0 ... NaN 143185.0 0 0 0 0 0 NaN 2.5 31649.2050
15 2020-06-30 Indiana IN 6745354 45594.0 8308.0 695.0 7065.0 272.0 97.0 ... NaN 45594.0 0 0 0 0 0 NaN 2.7 18212.4558
16 2020-06-30 Kansas KS 2910357 14443.0 13379.0 NaN 1152.0 NaN NaN ... NaN 14443.0 0 0 0 0 0 NaN 3.3 9604.1781
17 2020-06-30 Kentucky KY 4499692 15624.0 11069.0 408.0 2621.0 75.0 NaN ... NaN 15090.0 0 0 0 0 0 NaN 3.2 14399.0144
18 2020-06-30 Louisiana LA 4645184 58095.0 12649.0 781.0 NaN NaN 83.0 ... NaN 58095.0 0 0 0 0 0 NaN 3.3 15329.1072
19 2020-06-30 Massachusetts MA 6976597 108882.0 NaN 733.0 11337.0 120.0 63.0 ... NaN 103701.0 0 0 0 0 0 NaN 2.3 16046.1731
20 2020-06-30 Maryland MD 6083116 67559.0 59387.0 452.0 10844.0 152.0 NaN ... NaN 67559.0 0 0 0 0 0 NaN 1.9 11557.9204
21 2020-06-30 Maine ME 1345790 3253.0 502.0 29.0 348.0 9.0 4.0 ... 90532.0 2893.0 0 0 0 0 0 NaN 2.5 3364.4750
22 2020-06-30 Michigan MI 10045029 70728.0 13436.0 471.0 NaN 179.0 98.0 ... 974329.0 63870.0 0 0 0 0 0 NaN 2.5 25112.5725
23 2020-06-30 Minnesota MN 5700671 36303.0 3226.0 270.0 4054.0 136.0 NaN ... NaN 36303.0 0 0 0 0 0 NaN 2.5 14251.6775
24 2020-06-30 Missouri MO 6169270 21551.0 NaN 599.0 NaN NaN 66.0 ... 399926.0 21551.0 0 0 0 0 0 NaN 3.1 19124.7370
25 2020-06-30 Mississippi MS 2989260 27247.0 6786.0 779.0 3156.0 167.0 91.0 ... NaN 27067.0 0 0 0 0 0 NaN 4.0 11957.0400
26 2020-06-30 Montana MT 1086759 967.0 303.0 12.0 101.0 NaN NaN ... NaN 967.0 0 0 0 0 0 NaN 3.3 3586.3047
27 2020-06-30 North Carolina NC 10611862 64670.0 17789.0 908.0 NaN NaN NaN ... NaN 64670.0 0 0 0 0 0 NaN 2.1 22284.9102
28 2020-06-30 North Dakota ND 761723 3576.0 293.0 25.0 231.0 NaN NaN ... NaN 3576.0 0 0 0 0 0 NaN 4.3 3275.4089
29 2020-06-30 Nebraska NE 1952570 19042.0 5226.0 121.0 1330.0 NaN NaN ... NaN 19042.0 0 0 0 0 0 NaN 3.6 7029.2520
30 2020-06-30 New Hampshire NH 1371246 5760.0 958.0 34.0 565.0 NaN NaN ... NaN 5760.0 0 0 0 0 0 NaN 2.1 2879.6166
31 2020-06-30 New Jersey NJ 8936574 171667.0 126419.0 992.0 19847.0 211.0 174.0 ... NaN 171667.0 0 0 0 0 0 NaN 2.4 21447.7776
32 2020-06-30 New Mexico NM 2096640 11982.0 6193.0 119.0 1876.0 NaN NaN ... NaN 11982.0 0 0 0 0 0 NaN 1.8 3773.9520
33 2020-06-30 Nevada NV 3139658 18456.0 17260.0 593.0 NaN 134.0 65.0 ... NaN 18456.0 0 0 0 0 0 NaN 2.1 6593.2818
34 2020-06-30 New York NY 19440469 393454.0 298112.0 891.0 89995.0 217.0 137.0 ... NaN 393454.0 0 0 0 0 0 NaN 2.7 52489.2663
35 2020-06-30 Ohio OH 11747694 51789.0 NaN 722.0 7839.0 242.0 115.0 ... NaN 48222.0 0 0 0 0 0 NaN 2.8 32893.5432
36 2020-06-30 Oklahoma OK 3954821 13757.0 3285.0 315.0 1520.0 111.0 NaN ... 327840.0 13757.0 0 0 0 0 0 NaN 2.8 11073.4988
37 2020-06-30 Oregon OR 4301089 8656.0 5727.0 149.0 1038.0 42.0 25.0 ... 228978.0 8265.0 0 0 0 0 0 NaN 1.6 6881.7424
38 2020-06-30 Pennsylvania PA 12820878 86606.0 12405.0 634.0 NaN NaN 110.0 ... NaN 84130.0 0 0 0 0 0 NaN 2.9 37180.5462
39 2020-06-30 Rhode Island RI 1056161 16813.0 14232.0 74.0 2001.0 13.0 13.0 ... NaN 16813.0 0 0 0 0 0 NaN 2.1 2217.9381
40 2020-06-30 South Carolina SC 5210095 36399.0 20189.0 1021.0 2854.0 NaN NaN ... 334069.0 36297.0 0 0 0 0 0 NaN 2.4 12504.2280
41 2020-06-30 South Dakota SD 903027 6764.0 801.0 62.0 666.0 NaN NaN ... NaN 6764.0 0 0 0 0 0 NaN 4.8 4334.5296
42 2020-06-30 Tennessee TN 6897576 43509.0 15306.0 527.0 2665.0 NaN NaN ... 742366.0 43161.0 0 0 0 0 0 NaN 2.9 20002.9704
43 2020-06-30 Texas TX 29472295 159986.0 72744.0 6533.0 NaN NaN NaN ... NaN NaN 0 0 0 0 0 NaN 2.3 67786.2785
44 2020-06-30 Utah UT 3282115 22217.0 9647.0 250.0 1444.0 83.0 NaN ... NaN 22217.0 0 0 0 0 0 NaN 1.8 5907.8070
45 2020-06-30 Virginia VA 8626207 62787.0 52944.0 902.0 8982.0 230.0 98.0 ... NaN 60124.0 0 0 0 0 0 NaN 2.1 18115.0347
46 2020-06-30 Vermont VT 628061 1208.0 199.0 16.0 NaN NaN NaN ... NaN 1208.0 0 0 0 0 0 NaN 2.1 1318.9281
47 2020-06-30 Washington WA 7797095 32253.0 NaN 282.0 4323.0 NaN 50.0 ... NaN 32253.0 0 0 0 0 0 NaN 1.7 13255.0615
48 2020-06-30 Wisconsin WI 5851754 31662.0 8291.0 242.0 3446.0 79.0 NaN ... NaN 28659.0 0 0 0 0 0 NaN 2.1 12288.6834
49 2020-06-30 West Virginia WV 1778070 2905.0 540.0 27.0 NaN 10.0 3.0 ... NaN 2804.0 0 0 0 0 0 NaN 3.8 6756.6660

50 rows × 29 columns

  • Load and clean JHU data
  • Merge JHU dataset with main dataset
#Load the Johns Hopkins data
jhu_df.tail(50)
LastUpdate ProvinceState Active Confirmed Deaths Recovered
5145 2020-06-19 Alaska 695.0 707.0 12.0 0.0
5146 2020-06-19 Arizona 42162.0 43445.0 1283.0 0.0
5147 2020-06-19 Arkansas 13720.0 13928.0 208.0 0.0
5148 2020-06-19 California 161731.0 167086.0 5355.0 0.0
5149 2020-06-19 Colorado 28248.0 29886.0 1638.0 0.0
5150 2020-06-19 Connecticut 41214.0 45440.0 4226.0 0.0
5151 2020-06-19 Delaware 10068.0 10499.0 431.0 0.0
5152 2020-06-19 District of Columbia 9376.0 9903.0 527.0 0.0
5153 2020-06-19 Florida 82865.0 85926.0 3061.0 0.0
5154 2020-06-19 Georgia 58307.0 60912.0 2605.0 0.0
5155 2020-06-19 Hawaii 745.0 762.0 17.0 0.0
5156 2020-06-19 Idaho 3654.0 3743.0 89.0 0.0
5157 2020-06-19 Illinois 128241.0 134778.0 6537.0 0.0
5158 2020-06-19 Indiana 38947.0 41438.0 2491.0 0.0
5159 2020-06-19 Iowa 24181.0 24861.0 680.0 0.0
5160 2020-06-19 Kansas 11502.0 11753.0 251.0 0.0
5161 2020-06-19 Kentucky 12677.0 13197.0 520.0 0.0
5162 2020-06-19 Louisiana 45572.0 48634.0 3062.0 0.0
5163 2020-06-19 Maine 2776.0 2878.0 102.0 0.0
5164 2020-06-19 Maryland 60213.0 63229.0 3016.0 0.0
5165 2020-06-19 Massachusetts 98653.0 106422.0 7769.0 0.0
5166 2020-06-19 Michigan 60737.0 66798.0 6061.0 0.0
5167 2020-06-19 Minnesota 30299.0 31675.0 1376.0 0.0
5168 2020-06-19 Mississippi 19703.0 20641.0 938.0 0.0
5169 2020-06-19 Missouri 16426.0 17371.0 945.0 0.0
5170 2020-06-19 Montana 635.0 655.0 20.0 0.0
5171 2020-06-19 Nebraska 17175.0 17414.0 239.0 0.0
5172 2020-06-19 Nevada 11694.0 12169.0 475.0 0.0
5173 2020-06-19 New Hampshire 5119.0 5450.0 331.0 0.0
5174 2020-06-19 New Jersey 155238.0 168107.0 12869.0 0.0
5175 2020-06-19 New Mexico 9697.0 10153.0 456.0 0.0
5176 2020-06-19 New York 354786.0 385760.0 30974.0 0.0
5177 2020-06-19 North Carolina 46972.0 48168.0 1196.0 0.0
5178 2020-06-19 North Dakota 3118.0 3193.0 75.0 0.0
5179 2020-06-19 Ohio 40489.0 43122.0 2633.0 0.0
5180 2020-06-19 Oklahoma 8989.0 9355.0 366.0 0.0
5181 2020-06-19 Oregon 6179.0 6366.0 187.0 0.0
5182 2020-06-19 Pennsylvania 78322.0 84683.0 6361.0 0.0
5183 2020-06-19 Rhode Island 15384.0 16269.0 885.0 0.0
5184 2020-06-19 South Carolina 20912.0 21533.0 621.0 0.0
5185 2020-06-19 South Dakota 6031.0 6109.0 78.0 0.0
5186 2020-06-19 Tennessee 32262.0 32770.0 508.0 0.0
5187 2020-06-19 Texas 99130.0 101259.0 2129.0 0.0
5188 2020-06-19 Utah 15687.0 15839.0 152.0 0.0
5189 2020-06-19 Vermont 1079.0 1135.0 56.0 0.0
5190 2020-06-19 Virginia 54652.0 56238.0 1586.0 0.0
5191 2020-06-19 Washington 25947.0 27192.0 1245.0 0.0
5192 2020-06-19 West Virginia 2330.0 2418.0 88.0 0.0
5193 2020-06-19 Wisconsin 23157.0 23876.0 719.0 0.0
5194 2020-06-19 Wyoming 1126.0 1144.0 18.0 0.0
#Grab all historical data and ensure we have the 1st US case.
all_cases.tail()
date state abbrev population positive active hospitalizedCurrently hospitalizedCumulative inIcuCurrently onVentilatorCurrently ... negativeTestsViral positiveCasesViral commercialScore negativeRegularScore negativeScore positiveScore score grade bedsPerThousand total_beds
6029 2020-01-26 Washington WA 7797095 2.0 2.0 NaN NaN NaN NaN ... NaN NaN 0 0 0 0 0 NaN 1.7 13255.0615
6030 2020-01-25 Washington WA 7797095 2.0 2.0 NaN NaN NaN NaN ... NaN NaN 0 0 0 0 0 NaN 1.7 13255.0615
6031 2020-01-24 Washington WA 7797095 2.0 2.0 NaN NaN NaN NaN ... NaN NaN 0 0 0 0 0 NaN 1.7 13255.0615
6032 2020-01-23 Washington WA 7797095 2.0 2.0 NaN NaN NaN NaN ... NaN NaN 0 0 0 0 0 NaN 1.7 13255.0615
6033 2020-01-22 Washington WA 7797095 2.0 2.0 NaN NaN NaN NaN ... NaN NaN 0 0 0 0 0 NaN 1.7 13255.0615

5 rows × 29 columns

An Exploratory data analysis of the US dataset

Basic triad of the dataset: validating data types and data integrity of each row

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6034 entries, 0 to 6033
Data columns (total 29 columns):
date                      6034 non-null datetime64[ns]
state                     6034 non-null object
abbrev                    6034 non-null object
population                6034 non-null int64
positive                  6034 non-null float64
active                    6034 non-null float64
hospitalizedCurrently     3739 non-null float64
hospitalizedCumulative    3306 non-null float64
inIcuCurrently            1933 non-null float64
onVentilatorCurrently     1720 non-null float64
recovered                 6034 non-null float64
dataQualityGrade          5100 non-null object
lastUpdateEt              5679 non-null object
dateModified              5679 non-null object
checkTimeEt               5679 non-null object
death                     6034 non-null float64
hospitalized              3306 non-null float64
totalTestsViral           1640 non-null float64
positiveTestsViral        555 non-null float64
negativeTestsViral        557 non-null float64
positiveCasesViral        3206 non-null float64
commercialScore           6034 non-null int64
negativeRegularScore      6034 non-null int64
negativeScore             6034 non-null int64
positiveScore             6034 non-null int64
score                     6034 non-null int64
grade                     0 non-null float64
bedsPerThousand           6034 non-null float64
total_beds                6034 non-null float64
dtypes: datetime64[ns](1), float64(16), int64(6), object(6)
memory usage: 1.4+ MB
#We check the data type are correct above and review our combined, cleaned, validated, and merged data set for all 50 states:
covid_df.head(50)
date state abbrev population positive active hospitalizedCurrently hospitalizedCumulative inIcuCurrently onVentilatorCurrently ... negativeTestsViral positiveCasesViral commercialScore negativeRegularScore negativeScore positiveScore score grade bedsPerThousand total_beds
0 2020-06-30 Alaska AK 734002 940.000 400.000 18.000 nan nan 1.000 ... nan nan 0 0 0 0 0 nan 2.200 1614.804
1 2020-06-30 Alabama AL 4908621 38045.000 18229.000 776.000 2769.000 nan nan ... nan 37536.000 0 0 0 0 0 nan 3.100 15216.725
2 2020-06-30 Arkansas AR 3038999 20777.000 5976.000 290.000 1413.000 nan 67.000 ... nan 20777.000 0 0 0 0 0 nan 3.200 9724.797
3 2020-06-30 Arizona AZ 7378494 79215.000 68172.000 2793.000 4736.000 683.000 455.000 ... nan 78781.000 0 0 0 0 0 nan 1.900 14019.139
4 2020-06-30 California CA 39937489 222917.000 216937.000 6466.000 nan 1751.000 nan ... nan 222917.000 0 0 0 0 0 nan 1.800 71887.480
5 2020-06-30 Colorado CO 5845526 32511.000 26350.000 271.000 5442.000 nan nan ... nan 29651.000 0 0 0 0 0 nan 1.900 11106.499
6 2020-06-30 Connecticut CT 3563077 46514.000 34139.000 98.000 10268.000 nan nan ... nan 44534.000 0 0 0 0 0 nan 2.000 7126.154
7 2020-06-30 District of Columbia DC 720687 10327.000 8506.000 126.000 nan 33.000 22.000 ... nan 10327.000 0 0 0 0 0 nan 4.400 3171.023
8 2020-06-30 Delaware DE 982895 11474.000 4298.000 64.000 nan 13.000 nan ... nan 10430.000 0 0 0 0 0 nan 2.200 2162.369
9 2020-06-30 Florida FL 21992985 152434.000 148830.000 nan 14879.000 nan nan ... 2136944.000 152434.000 0 0 0 0 0 nan 2.600 57181.761
10 2020-06-30 Georgia GA 10736059 81291.000 78486.000 1459.000 11051.000 nan nan ... 759664.000 81291.000 0 0 0 0 0 nan 2.400 25766.542
11 2020-06-30 Hawaii HI 1412687 900.000 160.000 nan 111.000 nan nan ... 89677.000 900.000 0 0 0 0 0 nan 1.900 2684.105
12 2020-06-30 Iowa IA 3179849 29007.000 5080.000 133.000 nan 34.000 20.000 ... nan 29007.000 0 0 0 0 0 nan 3.000 9539.547
13 2020-06-30 Idaho ID 1826156 5752.000 1588.000 nan 322.000 nan nan ... nan 5212.000 0 0 0 0 0 nan 1.900 3469.696
14 2020-06-30 Illinois IL 12659682 144238.000 137114.000 1560.000 nan 401.000 185.000 ... nan 143185.000 0 0 0 0 0 nan 2.500 31649.205
15 2020-06-30 Indiana IN 6745354 45594.000 8308.000 695.000 7065.000 272.000 97.000 ... nan 45594.000 0 0 0 0 0 nan 2.700 18212.456
16 2020-06-30 Kansas KS 2910357 14443.000 13379.000 nan 1152.000 nan nan ... nan 14443.000 0 0 0 0 0 nan 3.300 9604.178
17 2020-06-30 Kentucky KY 4499692 15624.000 11069.000 408.000 2621.000 75.000 nan ... nan 15090.000 0 0 0 0 0 nan 3.200 14399.014
18 2020-06-30 Louisiana LA 4645184 58095.000 12649.000 781.000 nan nan 83.000 ... nan 58095.000 0 0 0 0 0 nan 3.300 15329.107
19 2020-06-30 Massachusetts MA 6976597 108882.000 100828.000 733.000 11337.000 120.000 63.000 ... nan 103701.000 0 0 0 0 0 nan 2.300 16046.173
20 2020-06-30 Maryland MD 6083116 67559.000 59387.000 452.000 10844.000 152.000 nan ... nan 67559.000 0 0 0 0 0 nan 1.900 11557.920
21 2020-06-30 Maine ME 1345790 3253.000 502.000 29.000 348.000 9.000 4.000 ... 90532.000 2893.000 0 0 0 0 0 nan 2.500 3364.475
22 2020-06-30 Michigan MI 10045029 70728.000 13436.000 471.000 nan 179.000 98.000 ... 974329.000 63870.000 0 0 0 0 0 nan 2.500 25112.572
23 2020-06-30 Minnesota MN 5700671 36303.000 3226.000 270.000 4054.000 136.000 nan ... nan 36303.000 0 0 0 0 0 nan 2.500 14251.678
24 2020-06-30 Missouri MO 6169270 21551.000 20536.000 599.000 nan nan 66.000 ... 399926.000 21551.000 0 0 0 0 0 nan 3.100 19124.737
25 2020-06-30 Mississippi MS 2989260 27247.000 6786.000 779.000 3156.000 167.000 91.000 ... nan 27067.000 0 0 0 0 0 nan 4.000 11957.040
26 2020-06-30 Montana MT 1086759 967.000 303.000 12.000 101.000 nan nan ... nan 967.000 0 0 0 0 0 nan 3.300 3586.305
27 2020-06-30 North Carolina NC 10611862 64670.000 17789.000 908.000 nan nan nan ... nan 64670.000 0 0 0 0 0 nan 2.100 22284.910
28 2020-06-30 North Dakota ND 761723 3576.000 293.000 25.000 231.000 nan nan ... nan 3576.000 0 0 0 0 0 nan 4.300 3275.409
29 2020-06-30 Nebraska NE 1952570 19042.000 5226.000 121.000 1330.000 nan nan ... nan 19042.000 0 0 0 0 0 nan 3.600 7029.252
30 2020-06-30 New Hampshire NH 1371246 5760.000 958.000 34.000 565.000 nan nan ... nan 5760.000 0 0 0 0 0 nan 2.100 2879.617
31 2020-06-30 New Jersey NJ 8936574 171667.000 126419.000 992.000 19847.000 211.000 174.000 ... nan 171667.000 0 0 0 0 0 nan 2.400 21447.778
32 2020-06-30 New Mexico NM 2096640 11982.000 6193.000 119.000 1876.000 nan nan ... nan 11982.000 0 0 0 0 0 nan 1.800 3773.952
33 2020-06-30 Nevada NV 3139658 18456.000 17260.000 593.000 nan 134.000 65.000 ... nan 18456.000 0 0 0 0 0 nan 2.100 6593.282
34 2020-06-30 New York NY 19440469 393454.000 298112.000 891.000 89995.000 217.000 137.000 ... nan 393454.000 0 0 0 0 0 nan 2.700 52489.266
35 2020-06-30 Ohio OH 11747694 51789.000 48926.000 722.000 7839.000 242.000 115.000 ... nan 48222.000 0 0 0 0 0 nan 2.800 32893.543
36 2020-06-30 Oklahoma OK 3954821 13757.000 3285.000 315.000 1520.000 111.000 nan ... 327840.000 13757.000 0 0 0 0 0 nan 2.800 11073.499
37 2020-06-30 Oregon OR 4301089 8656.000 5727.000 149.000 1038.000 42.000 25.000 ... 228978.000 8265.000 0 0 0 0 0 nan 1.600 6881.742
38 2020-06-30 Pennsylvania PA 12820878 86606.000 12405.000 634.000 nan nan 110.000 ... nan 84130.000 0 0 0 0 0 nan 2.900 37180.546
39 2020-06-30 Rhode Island RI 1056161 16813.000 14232.000 74.000 2001.000 13.000 13.000 ... nan 16813.000 0 0 0 0 0 nan 2.100 2217.938
40 2020-06-30 South Carolina SC 5210095 36399.000 20189.000 1021.000 2854.000 nan nan ... 334069.000 36297.000 0 0 0 0 0 nan 2.400 12504.228
41 2020-06-30 South Dakota SD 903027 6764.000 801.000 62.000 666.000 nan nan ... nan 6764.000 0 0 0 0 0 nan 4.800 4334.530
42 2020-06-30 Tennessee TN 6897576 43509.000 15306.000 527.000 2665.000 nan nan ... 742366.000 43161.000 0 0 0 0 0 nan 2.900 20002.970
43 2020-06-30 Texas TX 29472295 159986.000 72744.000 6533.000 nan nan nan ... nan nan 0 0 0 0 0 nan 2.300 67786.278
44 2020-06-30 Utah UT 3282115 22217.000 9647.000 250.000 1444.000 83.000 nan ... nan 22217.000 0 0 0 0 0 nan 1.800 5907.807
45 2020-06-30 Virginia VA 8626207 62787.000 52944.000 902.000 8982.000 230.000 98.000 ... nan 60124.000 0 0 0 0 0 nan 2.100 18115.035
46 2020-06-30 Vermont VT 628061 1208.000 199.000 16.000 nan nan nan ... nan 1208.000 0 0 0 0 0 nan 2.100 1318.928
47 2020-06-30 Washington WA 7797095 32253.000 30933.000 282.000 4323.000 nan 50.000 ... nan 32253.000 0 0 0 0 0 nan 1.700 13255.061
48 2020-06-30 Wisconsin WI 5851754 31662.000 8291.000 242.000 3446.000 79.000 nan ... nan 28659.000 0 0 0 0 0 nan 2.100 12288.683
49 2020-06-30 West Virginia WV 1778070 2905.000 540.000 27.000 nan 10.000 3.000 ... nan 2804.000 0 0 0 0 0 nan 3.800 6756.666

50 rows × 29 columns

The NaN values may indicate that there were no to few Covid-19 patients at these date points. We further analyse the statistical values of the dataset columns to ensure data integrity and accuracy.

#Validte the data with; mean, standard deviation, min/max quartiles:
covid_df.describe()
# TODO rounding up the numbers
population positive active hospitalizedCurrently hospitalizedCumulative inIcuCurrently onVentilatorCurrently recovered death hospitalized ... negativeTestsViral positiveCasesViral commercialScore negativeRegularScore negativeScore positiveScore score grade bedsPerThousand total_beds
count 6034.000 6034.000 6034.000 3739.000 3306.000 1933.000 1720.000 6034.000 6034.000 3306.000 ... 557.000 3206.000 6034.000 6034.000 6034.000 6034.000 6034.000 0.000 6034.000 6034.000
mean 6542177.949 21664.959 19012.828 1016.199 4420.804 435.195 221.276 4634.585 1122.867 4420.804 ... 304052.707 32743.682 0.000 0.000 0.000 0.000 0.000 nan 2.626 15805.406
std 7386887.293 47418.576 42396.431 1914.007 12997.569 686.535 325.642 11309.159 2952.200 12997.569 ... 401425.780 57145.293 0.000 0.000 0.000 0.000 0.000 nan 0.744 16159.400
min 567025.000 0.000 0.000 1.000 0.000 2.000 0.000 0.000 0.000 0.000 ... 17.000 0.000 0.000 0.000 0.000 0.000 0.000 nan 1.600 1318.928
25% 1778070.000 654.250 565.500 119.000 225.250 81.000 35.000 0.000 13.000 225.250 ... 50591.000 5129.750 0.000 0.000 0.000 0.000 0.000 nan 2.100 3773.952
50% 4499692.000 5309.000 4671.000 402.000 997.000 179.000 93.000 255.500 152.500 997.000 ... 148910.000 14087.000 0.000 0.000 0.000 0.000 0.000 nan 2.500 11557.920
75% 7797095.000 21609.000 17784.750 1009.500 3335.500 472.000 244.000 3252.000 806.750 3335.500 ... 399926.000 36296.000 0.000 0.000 0.000 0.000 0.000 nan 3.100 19124.737
max 39937489.000 393454.000 356899.000 18825.000 89995.000 5225.000 2425.000 84818.000 24855.000 89995.000 ... 2136944.000 393454.000 0.000 0.000 0.000 0.000 0.000 nan 4.800 71887.480

8 rows × 22 columns

#final_100k_last_month.head()
#Review the out for per capita measures:
final_100k_last_month.describe()
positive_100k active_100k recovered_100k death_100k hospitalizedCumulative_100k inIcuCurrently_100k onVentilatorCurrently_100k BedsPer100k
count 61.000 61.000 61.000 61.000 61.000 62.000 62.000 62.000
mean 358.759 336.008 170.212 17.931 34.329 113.658 62.620 13440.000
std 65.620 442.921 105.723 7.283 42.821 26.916 13.514 0.000
min 245.203 -2213.482 35.481 4.880 -93.926 70.613 39.353 13440.000
25% 308.315 292.339 107.989 12.184 21.638 94.079 53.461 13440.000
50% 344.558 332.717 147.227 17.253 25.122 111.563 62.120 13440.000
75% 405.031 370.778 211.312 23.811 29.823 126.991 74.683 13440.000
max 544.349 2291.210 626.665 33.917 246.371 167.561 94.521 13440.000

Graphical Exploratory Analysis

Plotting histograms, scatterplots and boxplots to assess the distribution of the entire US dataset.

#Validate all US data:
timeseries_usa_df.tail()
date positive_100k active_100k recovered_100k death_100k hospitalizedCurrently_100k inIcuCurrently_100k onVentilatorCurrently_100k BedsPer100k
156 2020-06-26 34335.924 20098.997 12643.998 1592.929 404.115 67.051 34.318 13440.000
157 2020-06-27 34829.638 20417.559 12812.241 1599.839 407.257 68.533 35.118 13440.000
158 2020-06-28 35334.565 20809.528 12921.408 1603.630 402.011 65.968 33.930 13440.000
159 2020-06-29 35834.140 20955.190 13269.151 1609.799 409.624 66.395 32.901 13440.000
160 2020-06-30 36369.224 21135.140 13616.885 1617.200 423.406 66.047 33.482 13440.000

Analysis of Hospitalizations by State

New York

C:\Users\Doctor Gomez\AppData\Roaming\Python\Python37\site-packages\pandas\plotting\_converter.py:129: FutureWarning:

Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.

To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')

Alabama

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Arizona

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, '% Positive Cases in Hospital')
Text(0, 0.5, 'No. Patients')

Arkansas

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')

California

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')

Colorado

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')
Text(0, 0.5, 'No. Killed')

Connecticut

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')
Text(0, 0.5, 'No. Killed')

Delaware

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Florida

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
# TODO fix legend/axis/plot alltogether
# Timeseries plt
fig, ax = plt.subplots(figsize = (16, 12))
plt.plot(fl.date, fl.positiveTestsViral, linewidth=4.7, color='r')
plt.title('Cummulative Number of Positive Viral Tests in Florida', fontsize=23)
plt.xlabel('Date')
plt.ylabel('No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, '% Infected')

Georgia

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, '% Infection Rate')

Hawaii

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')
Text(0, 0.5, 'No. Killed')
Text(0, 0.5, 'No. Killed')
Text(0, 0.5, 'No. Killed')
Text(0, 0.5, '% Infected')

Idaho

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')
Text(0, 0.5, 'No. Patients')

Iowa

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Kansas

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')

Kentucky

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')

Louisiana

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Maine

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Maryland

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Massachusetts

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Michigan

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Minnesota

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Mississippi

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Missouri

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Montana

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Nebraska

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Nevada:

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')

New Hampshire

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

New Jersey

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

New Mexico

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

North Carolina

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Ohio

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Oregon

Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Patients')
Text(0, 0.5, 'No. Killed')

Alabama

Alabama

Alabama

Alabama

Alabama

Alabama

Alabama

Alabama

Alabama

Alabama

Alabama

Alabama

Alabama

South Carolina:

Texas:

Mississippi:

Utah:

Oklahoma:

Assessing Correlation of Independent Variables

Build model for dependent Variable

  • To be used to predict current hospitalizations
  • Having more complete variables for in ICU currently and on Ventilator Currently will allow us to predict these numbers as well.